768 research outputs found
Robust distance correlation for variable screening
High-dimensional data are commonly seen in modern statistical applications,
variable selection methods play indispensable roles in identifying the critical
features for scientific discoveries. Traditional best subset selection methods
are computationally intractable with a large number of features, while
regularization methods such as Lasso, SCAD and their variants perform poorly in
ultrahigh-dimensional data due to low computational efficiency and unstable
algorithm. Sure screening methods have become popular alternatives by first
rapidly reducing the dimension using simple measures such as marginal
correlation then applying any regularization methods. A number of screening
methods for different models or problems have been developed, however, none of
the methods have targeted at data with heavy tailedness, which is another
important characteristics of modern big data. In this paper, we propose a
robust distance correlation (``RDC'') based sure screening method to perform
screening in ultrahigh-dimensional regression with heavy-tailed data. The
proposed method shares the same good properties as the original model-free
distance correlation based screening while has additional merit of robustly
estimating the distance correlation when data is heavy-tailed and improves the
model selection performance in screening. We conducted extensive simulations
under different scenarios of heavy tailedness to demonstrate the advantage of
our proposed procedure as compared to other existing model-based or model-free
screening procedures with improved feature selection and prediction
performance. We also applied the method to high-dimensional heavy-tailed RNA
sequencing (RNA-seq) data of The Cancer Genome Atlas (TCGA) pancreatic cancer
cohort and RDC was shown to outperform the other methods in prioritizing the
most essential and biologically meaningful genes
Bayesian indicator variable selection of multivariate response with heterogeneous sparsity for multi-trait fine mapping
Variable selection has been played a critical role in contemporary statistics
and scientific discoveries. Numerous regularization and Bayesian variable
selection methods have been developed in the past two decades for variable
selection, but they mainly target at only one response. As more data being
collected nowadays, it is common to obtain and analyze multiple correlated
responses from the same study. Running separate regression for each response
ignores their correlation thus multivariate analysis is recommended. Existing
multivariate methods select variables related to all responses without
considering the possible heterogeneous sparsity of different responses, i.e.
some features may only predict a subset of responses but not the rest. In this
paper, we develop a novel Bayesian indicator variable selection method in
multivariate regression model with a large number of grouped predictors
targeting at multiple correlated responses with possibly heterogeneous sparsity
patterns. The method is motivated by the multi-trait fine mapping problem in
genetics to identify the variants that are causal to multiple related traits.
Our new method is featured by its selection at individual level, group level as
well as specific to each response. In addition, we propose a new concept of
subset posterior inclusion probability for inference to prioritize predictors
that target at subset(s) of responses. Extensive simulations with varying
sparsity and heterogeneity levels and dimension have shown the advantage of our
method in variable selection and prediction performance as compared to existing
general Bayesian multivariate variable selection methods and Bayesian fine
mapping methods. We also applied our method to a real data example in imaging
genetics and identified important causal variants for brain white matter
structural change in different regions.Comment: 29 pages, 3 figure
- …